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Gaussian assumption is the most well-known and widely used distribution in many fields such 
as engineering, statistics and physics. One of the major reasons why the Gaussian distribution 
has become so prominent is because of the Central Limit Theorem (CLT) and the fact that 
the distribution of noise in numerous engineering systems is well captured by the Gaussian 
distribution. Moreover, features such as analytical tractability and easy generation of other 
distributions from the Gaussian distribution contributed further to the popularity of Gaussian 
distribution. Especially, when there is no information about the distribution of observations, 
Gaussian assumption appears as the most conservative choice. This follows from the fact that 
the Gaussian distribution minimizes the Fisher information, which is the inverse of the Cramer- 
Rao lower bound (CRLB) (or equivalently stated, the Gaussian distribution maximizes the 
CRLB). Therefore, any optimization based on the CRLB under the Gaussian assumption can be 
considered to be min-max optimal in the sense of minimizing the largest CRLB (see [1] and the 
references cited therein). 

Inspired by the early isoperimetric inequality for entropy introduced by Costa and Cover 
[2] and the more recent results of Rioul [3], Stoica and Babu [1], the goals of this paper are 
threefold: i) to illustrate a connection between [1] and the recent information theoretic results 
reported in [2], [3], ii) to present information theoretic and estimation theoretic justifications for 
the fact that the Gaussian assumption leads to the largest CRLB, iii) to show a slight extension 
of this result to the more general framework of correlated observations. Even though Stoica and 
Babu provided a simple and quite general proof of result that the largest CRLB is achievable 
by the Gaussian distribution, the proposed proof is only applicable to the situation when the 
observations are independent, i.e., the observation noise is white [1]. However, this result can 
be generalized to arbitrary correlations among samples. In many practical circumstances, the 
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correlation of the noise is inevitable since the observed data comes from a filter, and the filter 
introduces correlation. Therefore, the importance of this generalization cannot be ignored. This 
result is also closely related to two well-known results in information theory: first, the fact 
that a Gaussian random vector maximizes a differential entropy, and second, the worst additive 
noise lemma (see [3], [4], and the references cited therein). Several researchers have investigated 
relationships between estimation theoretic (statistical) concepts such as mean-square error and 
Fisher information and information theoretic concepts such as entropy and mutual information 
(see e.g., [2], [3] and the references cited therein). However, most of these results are inclined 
to be rather theoretical than practical. In this paper, we show how some of these results can be 
adopted to a more practical application involving the estimation of a communication channel via 
a training sequence. 

I. RELEVANCE 

The approach introduced herein paper can be adapted to optimally estimate unknown (deter- 
ministic or random) parameters in additive noise channels. As presented in the channel model 
(1), the additive noise channel is very general in the sense that the only assumption is the 
independence between data and noise w. Namely, the channel model does not require the 
Gaussian noise assumption, it admits correlation among noise terms, and it also allows for 
correlation among data terms. Therefore, the proposed approach can be generally used in signal 
processing applications involving parameter estimation, spectrum estimation, and optimization, 
wireless communications and information theory. This lecture note is also beneficial to courses 
related to such topics. 

II. PREREQUISITES 

The readers may require some knowledge about linear algebra, elementary probability theory, 
statistical signal processing, and basic information theory. 

III. PROBLEM STATEMENT 
Consider a random vector y which is generated by the following system of equations: 

y = x e + w, (1) 
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where y is an n x 1 observed random vector, x e denotes an n x 1 signal (random) vector 
which depends on a k x 1 unknown deterministic parameter vector 6, and w stands for the 
fix 1 zero-mean noise vector whose covariance matrix is E w . Random vectors x# and w are 
assumed independent of each other. The systems represented by the channel model (1) are quite 
numerous. In particular, the channel model (1) might consist of the samples of an arbitrary 
stochastic process such as ARMA (autoregressive moving average) or ARMAX (ARMA with 
exogenous inputs), as mentioned in [1]. 

Based on the channel model (1), we define the score function: 

s(0) = V e log/ y | Xe (y|x e ), (2) 

where V e denotes the gradient with respect to 0, and / y | Xe (y|x e ) is the conditional density 
function of y given x e . The Cramer-Rao lower bound (CRLB) is expressed by the diagonal 
elements of the inverse of the Fisher information matrix (FIM), and the FIM is represented as: 

J.(y) = E y [s(0)s(0) T ], (3) 

where the notation E y [-] stands for the expectation with respect to a random vector y, and 
superscript T denotes the operation of transposition for a vector or matrix. 

Our goal is to find an optimal estimator for the parameter in the sense that the estimated 
parameter minimizes the lower bound of the mean square error of the estimator in the worst 
case scenario. 

IV. MINIMUM FISHER INFORMATION- A STATISTICAL VIEWPOINT 

One of the common approaches to estimate unknown parameters is to build estimators that 
minimize the Cramer-Rao lower bound. Since CRLB is expressed as the inverse of FIM, mini- 
mizing the Cramer-Rao lower bound is equivalent to maximizing FIM. Given the channel model 
(1), the score function in (2) and the FIM in (3) can be re-expressed by the following procedure. 

Since / y | Xe (y|x e ) = /w(w)| w=y _ xe = / w (y - x e ), where / w (-) denotes the density function 
of the noise w, and x e and w are independent of each other, using the chain rule for computing 
the derivative of a function, the score function s(0) is re-written as: 

s(0) = V e log/ y | Xe (y|x e ) 

= V e log/ w (y - x e ) 

= -V e x e V w log/ w (w), (4) 
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where the gradient (Jacobian) of the vector x e is defined as the k x n matrix V e x e with its 
(i,j)th entry equal to ^f^ 2 -. Now it turns out that the FIM (3) can be expressed as: 

J„(y) = E XejW (V e x e V w log/ w (w)) (V e x e V w log/ w (w)) T 

= E XeiW [V e x e (V w log/ w (w)V w log/ w (w) T ) V e xJ] (5) 
= E xe [V e x e J(w)V e xJ] , (6) 

where the FIM with respect to w is defined as 

J(w) = E w [V w log/ w (w)V w log/ w (w) T ]. (7) 

In equation (5), the expectation with respect to both x e and w can be separated into the outer 
expectation with respect to x e and the inner expectation with respect to w since x e and w are 
independent of each other. When the vector x e is deterministic, the outer expectation is not 
required. Therefore, the term related to the random vector w becomes the FIM, J(w), defined 
in equation (7), and it is not affected by the outer expectation E Xe [-] in equation (6). 

The following result states that the FIM J(w), which is a positive semi-definite matrix, is 
lower-bounded by the FIM J(w G ) of a normally distributed random vector (w G ). 

Lemma 1 ( Cramer-Rao Inequality): For a random vector w and a Gaussian random vector w G 
whose covariance matrix S w is identical to the covariance matrix of w, the following inequality 
is satisfied: 

j( w ) y j(w G ), 

where notation y stands for "greater than or equal to", in the sense of the partial ordering of 
positive semi-definite matrices. 

Proof: The proof follows essentially [3]. First, we define the following two score functions: 

s w (w) = V w log/ w (w), 
s w G (w) = V w log/ WG (w). (8) 

The covariance matrix of the difference of the two score functions (8) is expressed as 



E v 



'sw(w) - s WG (w)) (s w (w) - s WG (w)) J 



(9) 
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and it is always greater than or equal to the zero matrix in terms of the positive semi-definite 
partial ordering. Notice further that (9) can be simplified to 

Sw^Wj S w G (w))( Sw^Wj S w G (w)) i 

= J(w) - E w [s w (w)s WG (w) T ] - E w [s WG (w)s w (w)] + J(w G ) 

= J(w)-J(w G ). (10) 

Since w G is a Gaussian random vector, s WG (w) = — S^w. Also, E w [s w (w)s WG (w) T ] = 
- I (V w /w(w)) w T dwS w 1 = / / w (w)dwE w ' = S w x by Green's identity (see e.g., [2] and the 
references cited therein). Here, Green's identity plays the role of the integration by parts for a 
vector. Since J(w G ) = S" 1 , the last equality in equation (10) is verified. Since the co variance 
matrix is always positive semi-definite, from equation (10), 



^s w (w) - s WG (w)) (s w (w) - s WG (w)) J 



J(w) - J(w G ) y 0. (11) 



Therefore, the proof is completed. ■ 
Due to Lemma 1, when w is a Gaussian random vector, the FIM J(w) is minimized, and 
consequently the FIM J©(y) is also minimized: 

J*(y) = E xe [V e x e J(w)V e x^] 
h E xe [V e x e J(w G )V e x^] 

= h(y), (12) 

where y = x e +w G , and the equality holds if and only if w is normally distributed. The inequality 
in equation (12) is due to the fact that for an arbitrary matrix C, the inequality CAC T ^ CBC T 
holds whenever positive semi-definite matrices A and B satisfy A ^ B. 

From equations (6) and (12), we know that the CRLB depends on the parameter 9 only through 
the FIM, J(w). In other words, the CRLB only depends on J(w) when xg is fixed. Therefore, 
the Gaussian random vector w G maximizes the CRLB (or, equivalently minimizes the FIM, 
Je(y)X when x e is fixed. Therefore, any design which optimizes the FIM (6) (or equivalently 
the CRLB) when the random vector w is Gaussian, can be considered min-max optimal in the 
light of generating the smallest FIM (or the largest CRLB) in the worst situation. 
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V. MINIMUM MUTUAL INFORMATION- AN INFORMATION THEORETIC 

VIEWPOINT 

It is well-known that, given the covariance matrix, a Gaussian random vector minimizes the 
FIM, a result referred to as the Cramer- Rao inequality (see [1], [3], and the references cited 
therein). On the other hand, a Gaussian random vector maximizes a differential entropy when 
the covariance matrix is given (see [3], [5], and the references cited therein). These two results 
are closely related to each other. First, consider this relationship for random variables. Given a 
random variable w and a Gaussian random variable w G , the following inequalities are satisfied: 

. J{w) > J(w G ) when N(w) = N(w G ), 

. N(w) > N(w G ) when J(w) = J(w G ), 
where N(-) denotes the entropy power of a random variable, and J(-) stands for the Fisher 
information of a random variable. The above inequalities are easily derived from this general 
inequality 

N(w)J(w) > 1, (13) 

where the equality holds if and only if w is Gaussian. The inequality (13) is referred to as the 
isoperimetric inequality for entropies (see [2], [6], and the references cited therein). 

When the variance of w is equal to the variance of w G , the inequality J(w) > J(w G ) can 
be derived from N(w) < N(w G ) using the isoperimetric inequality for entropies. However, 
we cannot derive the inequality N(w) < N(w G ) from J[w) > J(w G ) using the isoperimetric 
inequality. Instead, the worst additive noise lemma (see e.g., [3], [4], [7] and the references 
cited therein) can be derived from the inequality J(w) > J(w G ) when the variances of w and 
w G are identical. All the relationships mentioned above are also valid for random vectors if we 
substitute either |J(-)|™ or Tr{J(-)} for J(-). The trace and the determinant of a matrix are 
represented by the notations Tr{-} and | • |, respectively. Since the vector generalization is quite 
direct, these results are not mentioned here except the following lemma. 

Lemma 2 (Worst Additive Noise Lemma [4], [7]): For a random vector w and a Gaussian 
random vector w G whose covariance matrices are identical to each other, 

J(w + z G ;z G ) > J(w G + z G ;z G ), (14) 

where /(•; •) stands for mutual information, z G is a Gaussian random vector with zero mean and 
covariance matrix S z , and all random vectors are independent of one another. 
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Similar to Cramer-Rao inequality (see [1], [3], and the Lemma 1), the worst additive noise 
lemma shows that the mutual information J(w + z G ; z G ) is minimized when w is Gaussian. 
Consider that notation h(-) stands for differential entropy, and define the function: 



#(£ z ) = h(w + z G ) - h(w G + z G ) - h(w) + h(w G ). 



(15) 



The function g(-) is non-decreasing with respect to the covariance matrix £ z near the zero matrix 
0. This is because, due to Lemma 2, <?(E Z ) is always non-negative for a covariance matrix S z 
which is arbitrarily close to the zero matrix 0. Therefore, near the zero matrix, the first derivative 
of <?(X Z ) with respect to S z is always positive semi-definite, and using a vector version of de 
Bruijn's identity [8], the Cramer-Rao inequality is derived from the Lemma 2 as follows: 



V Sz /(w + z G ;z G ) 



s z =o 



V Sz #(E z ) 
- V Sz /(w G + z G ;z G ) 



s z =o 



y o 



y o 



£ z =0 

j(w)-J(w G ) y o, 



(16) 



where <^=^> stands for equivalence. 

Therefore, in equation (6), the FIM, J#(y), is expressed as 

J (y) = E X9 [V e x e J(w)V e x^] 



= 2E- 



V e x e ( V Sz /(w + z G ;z G ) 



s z =o 



the smallest FIM, Jg(y), in (12) is expressed as 



M?) = 2E : 



V e x e V Sz /(w G + z G ;z G ) 



s z =o 



V e x^ 



V e xJ 



(17) 



(18) 



and 



E. 



>- E- 



x e 



V e x e ^V Sz /(w + z G ;z G ) 
V e x e ( V Sz /(w G + z G ;z G 



s z =o 



s z =o 



V e xJ 



V e xJ 



(19) 



Therefore, one can do the min-max optimal design based on equations (17), (18), and (19). 
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VI. PRACTICAL APPLICATIONS 

The min-max approach can be adopted to many applications. One of the typical examples 
is the optimal training sequence design for estimating frequency-selective fading channels [9], 
[10]. As a distinctive feature to what was shown in [9], [10], the proposed approach does not 
require neither the assumption of Gaussian noise nor the white noise assumption. 

Assume that a linearly modulated signal filtered through a frequency- selective channel is 
modeled as follows: 



y = X^Sh + w, 
where y = [y , ■■■ , y n -i\ J \ w = [w , ■■■ , w n -i] T , h = [ho, ■■■ , h m -i] T , 



(20) 



X 



1 ••• 

e iuJ ° ■■■ 

... e i{n-l)u 



5 S_i • • • Si_ m 

51 So ' ' ' S2-m 

'n— 1 S'n— 2 ' ' ' Sn—m 



(21) 



u = 2nf is the frequency offset, {si_ m , . . . ,s n _i} stands for the training sequence samples, 
and {ho, ■ ■ ■ , h m -i} denote the taps of the channel impulse response, assumed of finite length 
m. The noise w is an arbitrary random vector with zero mean and noise co variance matrix S w . 

Since we want to find the optimal training sequences to estimate the channel impulse response 
and the frequency offset, we first define the unknown parameter vector as [o; , h R , h/] T , where 
h R and h/ denote the real and the imaginary parts of the channel h. 

Based on equation (6), 



J,(y) = 9te [V.£.J(w)V.tf] 
y [V^ e J(w G )V^f] 
h [V e ^(A min I)V^f ] 

= A min iHe [V e ^V e ^f] , 



(22) 
(23) 
(24) 
(25) 



where £ e = X Wo Sh, A min represents the minimum eigenvalue of the FIM, J(w G ), JHe[-] denotes 
the real part of a vector or matrix, and superscript H stands for Hermitian transposition. Since £ e 
is a complex-valued function which only depends on the unknown deterministic real parameters, 
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in equation (22), the equality holds with £He[-] and without the expectation. Due to the Lemma 
1, equation (23) is verified, and equation (24) is satisfied due to the eigenvalue decomposition. 

Equation (25) reveals the smallest FIM. It generates the worst CRLB, and it is exactly of 
the same form as the one shown in [9]. Using the same argument as in [9], the white training 
sequence is min-max optimal in this case. This min-max approach heavily depends on how much 
information we have about the unknown parameters. If we know the distribution of the noise 
vector w, then the min-max approach will be adopted based on equation (22), while equation 
(23) will be used when we only know the covariance matrix of the noise vector w. In both cases, 
the white training sequences are not optimal since the optimal design is affected by the FIM, 
J(w), which is related to the correlation of w. The optimal sequences may depend on either 
the noise distribution or, at least, the noise covariance matrix. However, without any information 
about the noise vector w, the white training sequences are optimal in the sense of minimizing 
the worst CRLB. 

The presented result, i.e., for a colored noise w with given correlation matrix, its FIM Jg(y) 
is minimized when the random vector w is Gaussian, can be also interpreted from a different 
standpoint as follows. In equation (1), assume y is passed through a whitening filter, and a new 
signal y is obtained. The noise present in the new output y is white since the correlation of the 
noise is eliminated by the whitening filter. Therefore, we can directly adopt the method proposed 
in [9] . However, the design of the whitening filter requires the covariance matrix of the noise w. 
If we have information about the covariance matrix of w, we can construct the optimal training 
sequences; if we do not have information about w, we have to follow the method proposed in 
equations (24) and (25), and use the fact that the covariance matrix is lower-bounded by the 
minimum eigenvalue of the covariance matrix multiplied by the identity matrix. 

VII. WHAT WE HAVE LEARNED 

The results provided in previous sections show that, given the covariance matrix S w , the FIM 
Je(y), (CRLB) is minimized (respectively maximized) by adopting the Gaussian assumption. 
This fact leads to the min-max optimal approach in the following sense: the FIM Jg(y) (CRLB) 
depends on the unknown parameters only through the FIM J(w). Since the Gaussian noise (not 
necessarily white) minimizes the FIM J(w), it also minimizes the FIM 3g(y) (or equivalently, 
it maximizes the CRLB). Therefore, the optimal design under the Gaussian assumption yields 
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the best CRLB in the worst case. The CRLB is also expressed using the mutual information. 
In the information theoretic viewpoint, the fact that a Gaussian random vector minimizes the 
FIM given the covariance matrix is related to the worst additive noise lemma and the fact that 
a Gaussian random vector maximizes the differential entropy given the covariance matrix. 
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